NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign

https://doi.org/10.1145/3695053.3731079

Quinn, Derrick; Yücel, E Ezgi; Prammer, Martin; Fan, Zhenxing; Skadron, Kevin; Patel, Jignesh M; Martínez, José F; Alian, Mohammad (June 2025, ACM)

Free, publicly-accessible full text available June 20, 2026
Accelerating Retrieval-Augmented Generation

https://doi.org/10.1145/3669940.3707264

Quinn, Derrick; Nouri, Mohammad; Patel, Neel; Salihu, John; Salemi, Alireza; Lee, Sukhan; Zamani, Hamed; Alian, Mohammad (March 2025, ACM)

An evolving solution to address hallucination and enhance accuracy in large language models (LLMs) is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4--27.9× faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7--26.3× lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM -- which is the most expensive component in today's servers -- from being stranded.
more » « less
Free, publicly-accessible full text available March 30, 2026
Per-Bank Bandwidth Regulation of Shared Last-Level Cache for Real-Time Systems

https://doi.org/10.1109/RTSS62706.2024.00036

Sullivan, Connor; Manley, Alex; Alian, Mohammad; Yun, Heechul (December 2024, IEEE)

Full Text Available
Userspace Networking in gem5

https://doi.org/10.1109/ISPASS61541.2024.00026

Umeike, Johnson; Agarwal, Siddharth; Lazarev, Nikita; Alian, Mohammad (May 2024, IEEE)

Full-system simulation of computer systems is critical for capturing the complex interplay between various hard-ware and software components in future systems. Modeling the network subsystem is indispensable for the fidelity of full-system simulations due to the increasing importance of scale-out systems. Over the last decade, the network software stack has undergone major changes, with userspace networking stacks and data-plane networks rapidly replacing the conventional kernel network stack. Nevertheless, the current state-of-the-art architectural simulator, gem5, still employs kernel networking, which precludes realistic network application scenarios. In this work, we first demonstrate the limitations of gem5's current network stack in achieving high network bandwidth. Then, we enable a userspace networking stack on gem5. We extend gem5's NIC hardware model and device driver to sup-port userspace device drivers running the DPDK framework. Additionally, we implement a network load generator hardware model in gem5 to generate various traffic patterns and per-form per-packet timestamp and latency measurements without introducing packet loss. We develop a suite of six network-intensive benchmarks for stress testing the host network stack. These applications, based on DPDK, can run on both gem5 and real systems. Our experimental results show that enabling userspace networking improves gem5's network bandwidth by 6.3× compared with the current Linux kernel software stack. We characterize the performance of DPDK benchmarks running on both a real system and gem5, and evaluate the sensitivity of the applications to various system and microarchitecture parameters. This work marks the first step in refactoring the networking subsystem in gem5.
more » « less
Full Text Available
SmartDIMM: In-Memory Acceleration of Upper Layer Protocols

https://doi.org/10.1109/HPCA57654.2024.00032

Patel, Neel; Mamandipoor, Amin; Nouri, Mohammad; Alian, Mohammad (March 2024, IEEE)

Full Text Available
XFM: Accelerated Software-Defined Far Memory

https://doi.org/10.1145/3613424.3623776

Patel, Neel; Mamandipoor, Amin; Quinn, Derrick; Alian, Mohammad (October 2023, ACM)

Full Text Available
In-Storage Domain-Specific Acceleration for Serverless Computing

https://doi.org/10.1145/3620665.3640413

Mahapatra, Rohan; Ghodrati, Soroush; Ahn, Byung Hoon; Kinzer, Sean; Wang, Shu-Ting; Xu, Hanyang; Karthikeyan, Lavanya; Sharma, Hardik; Yazdanbakhsh, Amir; Alian, Mohammad; et al (April 2024, ACM)

Full Text Available
Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators

https://doi.org/10.1109/HPCA57654.2024.00083

Wang, Shu-Ting; Xu, Hanyang; Mamandipoor, Amin; Mahapatra, Rohan; Ahn, Byung Hoon; Ghodrati, Soroush; Kailas, Krishnan; Alian, Mohammad; Esmaeilzadeh, Hadi (March 2024, IEEE)

Full Text Available
Profiling gem5 Simulator

https://doi.org/10.1109/ISPASS57527.2023.00019

Umeike, Johnson; Patel, Neel; Manley, Alex; Mamandipoor, Amin; Yun, Heechul; Alian, Mohammad (April 2023, IEEE)

In this work, we set out to find the answers to the following questions: (1) Where are the bottlenecks in a state-of-theart architectural simulator? (2) How much faster can architectural simulations run by tuning system configurations? (3) What are the opportunities in accelerating software simulation using hardware accelerators? We choose gem5 as the representative architectural simulator, run several simulations with various configurations, perform a detailed architectural analysis of the gem5 source code on different server platforms, tune both system and architectural settings for running simulations, and discuss the future opportunities in accelerating gem5 as an important application. Our detailed profiling of gem5 reveals that its performance is extremely sensitive to the size of the Ll cache. Our experimental results show that a RISC-V core with 32KB data and instruction cache improves gem5’s simulation speed by 31%-61% compared with a baseline core with 8KB Ll caches. Our paper is the first step toward building specialized hardware and software environments for accelerating software-based simulators.
more » « less
IDIO: Network-Driven, Inbound Network Data Orchestration on Server Processors

https://doi.org/10.1109/MICRO56248.2022.00042

Alian, Mohammad; Agarwal, Siddharth; Shin, Jongmin; Patel, Neel; Yuan, Yifan; Kim, Daehoon; Wang, Ren; Kim, Nam Sung (October 2022, IEEE/ACM International Symposium on Microarchitecture (MICRO))

Full Text Available

« Prev Next »

Search for: All records